Lossless Seeds for Searching Short Patterns with High Error Rates
نویسندگان
چکیده
We address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find all locations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds, combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of errors.
منابع مشابه
Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds
Article history: Available online 14 March 2016
متن کاملSeed-Set Construction by Equi-entropy Partitioning for Efficient and Sensitive Short-Read Mapping
Spaced seeds have been shown to be superior to continuous seeds for efficient and sensitive homology search based on the seedand-extend paradigm. Much the same is true in genome mapping of high-throughput short-read data. However, a highly sensitive search with multiple spaced patterns often requires the use of a great amount of index data. We propose a novel seed-set construction method for ef...
متن کاملSpaced Seeds Design Using Perfect Rulers
We consider the problem of lossless spaced seed design for approximate pattern matching. We show that, using mathematical objects known as perfect rulers, we can derive a family of spaced seeds for matching with up to two errors. We analyze these seeds with respect to the trade-off they offer between seed weight and the minimum length of the pattern to be matched. We prove that for patterns of ...
متن کاملNew Methods for Lossless Image Compression Using Arithmetic Coding
Lossless image compression presents a unique set of challenges. Considerable research has already been done on lossless text compression [1,2,3,4,5]; all good methods found to date involve some form of moderately high-order exact string matching. However, this work cannot easily be carried over to lossless image compression, for two reasons: First, images are two-dimensional, so the contexts ar...
متن کاملMulti-seed Lossless Filtration (Extended Abstract)
We study a method of seed-based lossless filtration for approximate string matching and related applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques t...
متن کامل